CMPINF 2100 Week 08¶

Introduction to PCA in order to support Cluster Analysis¶

Import Modules¶

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

Read data¶

Start with the penguins dataset again.

In [3]:
penguins = sns.load_dataset("penguins")
In [4]:
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

WHY PCA???¶

The penguins data has 4 num cols!!

In [5]:
sns.pairplot(data=penguins)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

There is a clear relationship or correlation structure between several of the num cols!

In [7]:
fig, ax = plt.subplots()

sns.heatmap(data=penguins.corr(numeric_only=True),
            vmin=-1,
            vmax=1,
            center=0,
            annot=True,
            annot_kws={"fontsize":15},
           cmap="coolwarm")

plt.show()
No description has been provided for this image

BUT WHY PCA??

PCA tries to EXPLOIT correlation between variables. This is beneficial because maybe we do NOT actually need to look at ALL pairs of scatter plots!

If we can exploit the RELATIONSHIP between variables, maybe we can CREATE NEW variables that CAPTURE the impact or influence of ALL variables!!!

Then instead of having to explore a large number of figs, we can focus on the relationship between several NEWLY created variables!!!

PCA will be discussed in more detail in CMPINF 2120. We will also revisit PCA later in the semester in this course CMPINF 2100. But for now lets just see how to USE PCA to support visualization.

Executive PCA¶

Before executing PCA, we MUST deal with MISSINGS, such as DROPPING THEM!! Also, is it HIGHLY RECOMMENDED that you STANDARDIZE the variables BEFORE applying PCA!!!

In [8]:
pens_clean = penguins.dropna().copy()
In [9]:
pens_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB

STANDARDIZE using StandardScaler() method from scikit-learn.

In [10]:
from sklearn.preprocessing import StandardScaler

Standardize numeric columns

In [11]:
pens_clean_features = pens_clean.select_dtypes("number").copy()
In [12]:
Xpens = StandardScaler().fit_transform(pens_clean_features)

We can use the PCA method from scikit-learn to execute the TRANSFORMATION!!!

The transformation produces NEW vars that ACCOUNT for the relationship between ALL of the original numeric variables!!

In [13]:
from sklearn.decomposition import PCA

PCA follows the logic of StandardScaler. We must:

  • INITIALIZE the object based on ASSUMPTIONS
  • FIT the object
  • TRANSFORM a data set using the FITTED object

The main assumption we need for PCA is the NUMBER OF COMPONENTS or the number of NEWLY CREATED VARIABLES to produce.

We will NOT discuss how to decide the BEST number of new vars today. Instead, we will just focus on 2 because we will VISUALIZE 2 numeric variables via scatter plots!!

Apply PCA in 1 line of code by INITIALIZING then FITTING and TRANSFORMING!!!

In [14]:
pca_pens = PCA(n_components=2).fit_transform(Xpens)
In [15]:
type(pca_pens)
Out[15]:
numpy.ndarray
In [16]:
pca_pens
Out[16]:
array([[-1.85359302e+00,  3.20693765e-02],
       [-1.31625406e+00, -4.43526765e-01],
       [-1.37660509e+00, -1.61230478e-01],
       [-1.88528838e+00, -1.23512351e-02],
       [-1.91998074e+00,  8.17598126e-01],
       [-1.77302031e+00, -3.66222957e-01],
       [-8.18496250e-01,  5.01243084e-01],
       [-1.79895773e+00, -2.45393945e-01],
       [-1.95614892e+00,  9.98282895e-01],
       [-1.56952316e+00,  5.78081948e-01],
       [-1.74800122e+00, -6.10244291e-01],
       [-1.57577371e+00,  8.68357265e-02],
       [-8.04720190e-01,  1.29355592e+00],
       [-2.35017809e+00, -6.45191072e-01],
       [-1.00498645e+00,  1.97242251e+00],
       [-2.40824844e+00, -3.08968645e-01],
       [-2.11369825e+00, -1.36493144e-01],
       [-1.85705729e+00, -1.09144060e-01],
       [-1.50501042e+00, -2.89127997e-01],
       [-1.58113786e+00, -6.03932517e-01],
       [-1.92846722e+00, -2.97394981e-01],
       [-1.76295054e+00,  1.38259762e-01],
       [-1.70361341e+00, -1.87802307e-01],
       [-2.71417458e+00, -2.01106317e-01],
       [-1.68232816e+00,  2.85542330e-01],
       [-1.87994963e+00, -7.82580998e-01],
       [-1.91081367e+00, -4.06695073e-01],
       [-1.65683258e+00, -3.28286332e-01],
       [-1.51840291e+00,  3.26408242e-01],
       [-1.44646684e+00, -9.87685263e-01],
       [-1.44062410e+00,  1.05909586e+00],
       [-1.63466140e+00,  5.48223391e-01],
       [-1.73335112e+00,  2.72394506e-01],
       [-2.40765908e+00,  6.73451508e-02],
       [-1.13764744e+00,  3.57809820e-01],
       [-2.29657080e+00, -5.93801144e-01],
       [-9.71848773e-01,  1.17509989e-01],
       [-2.30890668e+00, -4.49404139e-01],
       [-5.78401946e-01,  1.05458646e+00],
       [-2.01067992e+00, -9.97271019e-01],
       [-8.80262620e-01,  2.12079200e-01],
       [-1.92925587e+00,  3.42881528e-01],
       [-1.78298528e+00, -6.57410953e-01],
       [-1.40940140e+00,  1.43826097e+00],
       [-1.57392895e+00, -3.39592411e-01],
       [-1.14654389e+00,  2.78170592e-01],
       [-1.86608339e+00, -7.67327681e-01],
       [-7.86733863e-01,  7.11147080e-01],
       [-2.44789222e+00, -7.94851225e-01],
       [-1.26418254e+00,  2.43767425e-01],
       [-1.54901519e+00, -4.81769739e-01],
       [-1.22044841e+00,  2.47154032e-01],
       [-2.25876529e+00, -1.18962297e+00],
       [-1.52359256e+00,  3.45359658e-02],
       [-2.01615696e+00, -1.12589726e+00],
       [-1.13641794e+00,  1.31328324e+00],
       [-1.57091360e+00, -8.33767737e-01],
       [-9.27431832e-01,  8.25272607e-02],
       [-2.24489579e+00, -9.96917698e-01],
       [-9.13660651e-01,  4.69928294e-02],
       [-1.34180687e+00, -1.40816257e+00],
       [-1.24076891e+00,  4.50049108e-01],
       [-1.80093377e+00, -1.23282996e+00],
       [-5.92025660e-01,  6.85886654e-01],
       [-2.11142026e+00, -4.72533742e-01],
       [-1.26934357e+00, -5.46639757e-03],
       [-1.02609887e+00, -5.33157420e-01],
       [-4.04478774e-01,  8.94152813e-01],
       [-1.57243857e+00, -8.50558396e-01],
       [-5.86662602e-01,  4.11120851e-01],
       [-9.40429442e-01, -5.40033074e-01],
       [-1.92733875e+00,  1.22172496e-01],
       [-1.45634863e+00, -1.35600020e+00],
       [-9.37516173e-01,  5.53350680e-01],
       [-1.96939557e+00, -1.11892242e+00],
       [-4.68328228e-02,  1.00901565e-01],
       [-1.79183530e+00, -1.84002791e-01],
       [-1.52578740e+00, -7.63992510e-02],
       [-1.68181280e+00, -5.64107174e-01],
       [-1.59639832e+00,  9.08101945e-01],
       [-1.84348437e+00,  5.67099175e-02],
       [-1.85729280e+00, -2.70705718e-01],
       [-1.55507132e+00,  1.68921592e-01],
       [-1.62210133e+00,  4.00341309e-02],
       [-1.26523377e+00, -6.35421240e-01],
       [-2.00393880e-01,  7.11886336e-02],
       [-2.02709537e+00, -1.20799740e+00],
       [-1.00562068e+00, -8.72793004e-02],
       [-1.87080090e+00, -8.93881281e-01],
       [-2.64027702e-01,  3.63384244e-01],
       [-1.57962369e+00, -1.19371375e-01],
       [-6.84823504e-01,  1.46252964e-01],
       [-2.52929467e+00, -1.76228161e+00],
       [-7.79625987e-01,  4.39581247e-01],
       [-1.59563939e+00, -7.40347063e-01],
       [-3.86175400e-01,  8.69122059e-01],
       [-1.80101966e+00, -1.27844482e+00],
       [-1.51265850e+00,  4.66837670e-01],
       [-2.00243546e+00, -2.13819033e-01],
       [-1.85740516e+00,  1.61221992e-01],
       [-8.48810852e-01, -6.22812685e-01],
       [-1.71870377e+00,  4.77518186e-01],
       [-1.98479380e+00, -8.20882689e-01],
       [-2.13534661e-01,  7.08300149e-01],
       [-7.38240082e-01, -9.54490503e-01],
       [-6.44875034e-01,  1.47936161e+00],
       [-1.48219852e+00, -3.54236566e-01],
       [-7.39940610e-01,  7.53287890e-01],
       [-1.70321099e+00,  9.15253811e-01],
       [-6.32808164e-01,  3.02917228e-01],
       [-1.84273238e+00, -7.89182569e-01],
       [-1.60946727e+00,  5.72883733e-01],
       [-1.73484802e+00, -1.06473097e+00],
       [-1.62792299e+00,  1.74301453e-01],
       [-1.95305685e+00, -9.48638014e-01],
       [-1.66339271e+00,  3.06844796e-01],
       [-1.82836539e+00, -5.65972121e-01],
       [-6.70854593e-01,  2.24468852e-01],
       [-1.88191000e+00, -1.59486466e+00],
       [-8.76999276e-01,  3.49638746e-01],
       [-1.56785175e+00, -4.87347294e-01],
       [-6.19917518e-01,  1.92001813e-01],
       [-1.60358507e+00, -6.89218352e-01],
       [ 7.01808888e-02,  3.33984567e-01],
       [-1.66069875e+00, -3.94507052e-01],
       [-1.13411289e+00,  6.57034153e-01],
       [-1.68043855e+00, -3.20534232e-01],
       [-7.08387339e-01, -1.48385164e-01],
       [-1.68833943e+00, -5.51677888e-01],
       [-9.70355138e-01, -2.16004046e-01],
       [-1.88183815e+00, -8.89082390e-01],
       [-1.10935310e+00,  7.49111595e-01],
       [-1.65603365e+00, -1.12119459e+00],
       [-8.04934115e-01, -1.73395579e-01],
       [-1.18214807e+00, -5.23204908e-01],
       [-1.36523239e+00, -4.34095818e-01],
       [-1.97590114e+00, -2.09674424e+00],
       [-1.02176382e+00, -4.79069974e-01],
       [-1.67693429e+00, -1.00189205e+00],
       [-1.76540033e+00,  1.32217563e-02],
       [-1.11219725e+00,  5.38438736e-02],
       [-2.06481174e+00, -3.89108766e-01],
       [-1.55660384e+00, -6.95834197e-01],
       [-1.34524467e+00, -3.48806583e-01],
       [-1.57336339e+00, -9.58805742e-01],
       [-6.18303403e-01,  2.46934847e-01],
       [-7.93836850e-01,  5.02297056e-01],
       [-3.89368970e-01,  1.57456101e+00],
       [-5.15027365e-01,  1.57096244e+00],
       [-1.19537904e+00,  7.06041689e-01],
       [-3.04312640e-01,  1.97658038e+00],
       [-3.26614065e-01,  3.64192175e-01],
       [-1.63592055e+00,  5.50237855e-01],
       [-7.88452183e-02,  1.17721487e+00],
       [-4.70293902e-01,  9.15308965e-01],
       [-4.16818937e-01,  1.86122420e+00],
       [-5.18914090e-01,  5.01742105e-01],
       [-5.78352198e-01,  2.07263418e+00],
       [-7.82308030e-01,  3.30433534e-01],
       [ 3.69588535e-01,  1.24385074e+00],
       [-7.12498702e-01,  1.18722739e-01],
       [-5.94770968e-02,  1.68634410e+00],
       [-8.34897001e-01,  1.75334376e+00],
       [-1.34571116e-01,  1.74031931e+00],
       [-1.06082686e+00,  7.69161630e-01],
       [ 1.08599515e-01,  1.00737973e+00],
       [-1.39779585e+00, -1.86348140e-01],
       [-6.56046746e-01,  5.50241661e-01],
       [-1.42052018e+00, -4.45944135e-01],
       [-5.11234668e-01,  1.58926869e+00],
       [-7.90299107e-01,  5.06500522e-01],
       [ 9.04349588e-02,  1.61612776e+00],
       [-3.01545221e-01,  1.13821856e+00],
       [-2.32942707e-01,  1.30929055e+00],
       [-6.86335466e-01,  4.69421229e-01],
       [ 5.57174822e-01,  2.15032356e+00],
       [-1.40654483e+00, -6.70221602e-01],
       [ 1.75368656e-01,  2.60270659e+00],
       [-1.19133191e+00, -4.39598103e-01],
       [ 2.61046726e-01,  1.42295499e+00],
       [-4.77965711e-01,  1.14822032e+00],
       [ 7.44911095e-02,  2.07746781e-01],
       [-4.20670546e-01,  8.19697344e-01],
       [ 7.25638767e-01,  2.37167261e+00],
       [-1.04370429e+00, -5.62049265e-02],
       [ 6.01454563e-01,  2.18201887e+00],
       [ 1.38759674e-01,  1.47518981e+00],
       [-8.41124399e-01,  3.19554637e-01],
       [-4.72686923e-01,  1.47823497e+00],
       [-5.29414382e-01,  2.96136491e-02],
       [-1.43693398e-01,  1.00422814e+00],
       [ 4.62160556e-01,  1.31195693e+00],
       [-6.45485399e-01,  8.87659748e-01],
       [ 4.40184372e-01,  1.54979441e+00],
       [-9.17707186e-01,  1.34996671e+00],
       [-3.08991857e-02,  6.41199552e-01],
       [-1.87582034e-01,  5.70474676e-02],
       [ 6.87115867e-02,  1.53281144e+00],
       [-6.28963576e-01,  1.81340228e-01],
       [ 1.92827122e-02,  1.74964587e+00],
       [-1.31309930e+00, -1.96650725e-01],
       [-3.30925312e-01,  1.49055629e+00],
       [-8.50169944e-01, -1.91170114e-01],
       [-1.37643774e-01,  1.67674491e+00],
       [-5.17501489e-02,  1.30607700e+00],
       [-1.07351711e+00,  1.01394523e+00],
       [ 2.14874699e-01,  1.79229394e+00],
       [-5.05885138e-01, -1.85804289e-02],
       [-4.51461631e-01,  6.54489015e-02],
       [ 5.53474521e-01,  2.34761163e+00],
       [-7.39913565e-01,  2.48154967e-01],
       [-3.67889760e-01,  9.91079624e-01],
       [ 4.92359602e-01,  1.48484928e+00],
       [-2.13416837e-01,  1.26155380e+00],
       [ 1.59356859e+00, -1.34179573e+00],
       [ 2.89205390e+00,  4.64090012e-01],
       [ 1.55157173e+00, -6.96759932e-01],
       [ 2.62068561e+00,  1.37233188e-02],
       [ 2.23455895e+00, -5.63287236e-01],
       [ 1.55889027e+00, -1.17201378e+00],
       [ 1.45637702e+00, -8.23329213e-01],
       [ 2.02554963e+00, -3.55648737e-01],
       [ 1.16950299e+00, -1.57891764e+00],
       [ 1.81451188e+00, -3.10575391e-01],
       [ 1.28618821e+00, -1.69540027e+00],
       [ 2.16995116e+00,  2.53134960e-01],
       [ 1.66843953e+00, -1.18978332e+00],
       [ 2.50595964e+00, -3.92893307e-01],
       [ 1.03819687e+00, -8.36838135e-01],
       [ 2.52237724e+00,  1.53089665e-01],
       [ 9.11480748e-01, -1.70468040e+00],
       [ 3.08806126e+00, -1.59072567e-02],
       [ 1.46071532e+00, -7.76714255e-01],
       [ 2.45853762e+00, -2.01291445e-01],
       [ 2.81995631e+00, -3.28714403e-01],
       [ 1.75334566e+00, -8.76120401e-01],
       [ 1.37704625e+00, -7.80126191e-01],
       [ 1.62341756e+00, -2.13079172e-01],
       [ 1.85465371e+00, -1.68481442e+00],
       [ 1.78304339e+00, -5.13745958e-01],
       [ 2.32062326e+00, -3.15071902e-01],
       [ 1.57198405e+00, -6.56470332e-01],
       [ 2.58027529e+00,  4.07762389e-02],
       [ 2.23324413e+00, -2.83702741e-01],
       [ 1.17069843e+00, -1.28141516e+00],
       [ 1.45779016e+00, -8.74674010e-01],
       [ 3.78701834e+00,  1.83601539e+00],
       [ 2.33349180e+00, -2.98646309e-01],
       [ 2.14182216e+00,  2.55556267e-01],
       [ 1.59133864e+00, -1.48042442e+00],
       [ 1.46271619e+00,  2.06122550e-01],
       [ 1.11168166e+00, -1.42616223e+00],
       [ 1.75972697e+00,  3.58655736e-02],
       [ 7.09891542e-01, -1.56660409e+00],
       [ 2.71361148e+00,  2.96581645e-01],
       [ 1.24766590e+00, -1.24670723e+00],
       [ 1.89611421e+00, -2.02401218e-01],
       [ 2.58249171e+00,  3.39509176e-01],
       [ 1.76453362e+00, -1.29262601e+00],
       [ 1.15532939e+00, -1.15325176e+00],
       [ 2.60359333e+00,  3.26484463e-01],
       [ 1.96619306e+00, -1.37531536e+00],
       [ 1.70292714e+00, -3.10211734e-01],
       [ 1.63023914e+00, -8.49052487e-01],
       [ 2.52824538e+00, -6.33769450e-01],
       [ 1.15735133e+00, -9.75741632e-01],
       [ 2.47953715e+00, -1.19944640e-01],
       [ 1.90404533e+00, -7.71411353e-01],
       [ 1.80265515e+00, -5.15867852e-01],
       [ 9.99994848e-01, -1.33142706e+00],
       [ 1.89119896e+00, -6.27629575e-01],
       [ 9.30919099e-01, -1.14016421e+00],
       [ 2.77838403e+00,  8.63973188e-02],
       [ 1.07656958e+00, -1.21655353e+00],
       [ 2.21598059e+00, -5.62234490e-01],
       [ 1.47355252e+00, -1.11059336e+00],
       [ 3.37817705e+00,  6.89442994e-01],
       [ 1.83216652e+00, -9.47529000e-01],
       [ 2.77396146e+00,  6.44562815e-01],
       [ 2.89794904e+00,  3.77737157e-01],
       [ 1.68225824e+00, -1.19992388e+00],
       [ 2.82297979e+00, -2.51494909e-03],
       [ 1.73822780e+00, -4.11243001e-01],
       [ 1.88543725e+00, -2.85343544e-01],
       [ 2.10338085e+00, -7.79830974e-02],
       [ 2.02796808e+00, -5.80915426e-01],
       [ 1.59601675e+00, -5.58889916e-01],
       [ 2.90496724e+00,  1.98243239e-01],
       [ 1.49289256e+00, -7.74316869e-01],
       [ 2.77638908e+00,  6.09393449e-01],
       [ 1.73279991e+00, -1.17234318e+00],
       [ 2.35528428e+00, -2.13839444e-03],
       [ 1.70570972e+00, -4.73358039e-01],
       [ 2.69998725e+00,  4.27945010e-01],
       [ 1.61251537e+00, -6.10214911e-01],
       [ 2.48664340e+00,  2.66357334e-01],
       [ 1.58421835e+00, -1.20655899e+00],
       [ 2.60478500e+00,  9.46598161e-01],
       [ 1.48255754e+00, -1.14027062e+00],
       [ 2.65819078e+00, -2.86338576e-01],
       [ 1.84514307e+00, -8.27905113e-01],
       [ 2.82194749e+00,  9.64088245e-01],
       [ 1.94077693e+00, -4.13378481e-01],
       [ 2.62497748e+00,  1.00047845e+00],
       [ 1.49201527e+00, -8.57170341e-01],
       [ 2.60960622e+00,  3.20912437e-01],
       [ 1.51912978e+00, -8.75767080e-01],
       [ 2.57359526e+00,  2.59869947e-01],
       [ 1.83678033e+00,  1.16188362e-01],
       [ 2.08569057e+00, -6.46771800e-01],
       [ 1.29687923e+00, -5.94513353e-01],
       [ 2.42913431e+00,  6.21116375e-01],
       [ 1.99672542e+00, -3.12558491e-01],
       [ 3.08946906e+00,  1.38569979e+00],
       [ 1.70781430e+00, -2.42760558e-01],
       [ 2.86192618e+00, -1.81068897e-01],
       [ 1.91173444e+00,  6.14939272e-03],
       [ 1.01903506e+00, -1.19945381e+00],
       [ 2.68593517e+00,  6.11780499e-01],
       [ 1.12616049e+00, -1.31974077e+00],
       [ 1.97540337e+00, -2.58352741e-01],
       [ 2.10123089e+00,  1.28213706e-03],
       [ 3.08631267e+00,  3.03503990e-01],
       [ 1.15660747e+00, -8.02661928e-01],
       [ 2.87996707e+00,  6.09944432e-01],
       [ 1.58107282e+00, -9.75789324e-01],
       [ 3.47928847e+00,  9.17457141e-01],
       [ 2.68799274e+00,  3.16920939e-01],
       [ 1.99771558e+00, -9.76771459e-01],
       [ 1.83265107e+00, -7.84509926e-01],
       [ 2.75150503e+00,  2.66555715e-01],
       [ 1.71385366e+00, -7.25875158e-01],
       [ 2.01853683e+00,  3.36553720e-01]])
In [17]:
pca_pens.shape
Out[17]:
(333, 2)
In [18]:
Xpens.shape
Out[18]:
(333, 4)
In [19]:
pens_clean_features.shape
Out[19]:
(333, 4)

Convert the NumPy array pca_pens into a DataFrame to support visualization.

Name the cols, pc01 and pc02.

In [21]:
pca_pens_df = pd.DataFrame(pca_pens, columns=["pc01", "pc02"])
In [22]:
pca_pens_df
Out[22]:
pc01 pc02
0 -1.853593 0.032069
1 -1.316254 -0.443527
2 -1.376605 -0.161230
3 -1.885288 -0.012351
4 -1.919981 0.817598
... ... ...
328 1.997716 -0.976771
329 1.832651 -0.784510
330 2.751505 0.266556
331 1.713854 -0.725875
332 2.018537 0.336554

333 rows × 2 columns

Visualize the relationsip beween these two NEWLY ceated vars as a scatter plot.

In [23]:
sns.relplot(data=pca_pens_df, x="pc01", y="pc02")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Lets calculate the CORRELATION MATRIX between these two new vars!

In [31]:
fig, ax = plt.subplots()

sns.heatmap(pca_pens_df.corr(), 
            vmin=-1,
            vmax=1,
            center=0,
            fmt=".3f",
            cmap="coolwarm",
            cbar=False,
            annot=True,
            annot_kws={"fontsize": 15},
            ax=ax)

plt.show()
No description has been provided for this image
In [32]:
sns.lmplot(data=pca_pens_df, x="pc01", y="pc02")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

But we can include GROUPING variables with our PCA!!!

In [33]:
pca_pens_df["species"] = pens_clean.species
In [36]:
sns.lmplot(data=pca_pens_df, x="pc01", y="pc02", hue="species")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

We saw it was easy to SEPARATE the penguins data into 2 clusters!!

The PCA or the NEWLY CREATED VARS are EASILY identifying the 2 PRIMARY GROUPS in the data!!!!

Clustering and PCA¶

Instead of visualizing the Clustering results on the original variables, lets visualize the CLUSTERING results with the NEWLY created PCA!!!

In [38]:
from sklearn.cluster import KMeans
In [42]:
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(Xpens)
In [44]:
pca_pens_df['k2'] = pd.Series(clusters_2, index=pca_pens_df.index).astype("category")
In [45]:
pca_pens_df
Out[45]:
pc01 pc02 species k2
0 -1.853593 0.032069 Adelie 0
1 -1.316254 -0.443527 Adelie 0
2 -1.376605 -0.161230 Adelie 0
3 -1.885288 -0.012351 NaN 0
4 -1.919981 0.817598 Adelie 0
... ... ... ... ...
328 1.997716 -0.976771 Gentoo 1
329 1.832651 -0.784510 Gentoo 1
330 2.751505 0.266556 Gentoo 1
331 1.713854 -0.725875 Gentoo 1
332 2.018537 0.336554 Gentoo 1

333 rows × 4 columns

In [46]:
pca_pens_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   pc01     333 non-null    float64 
 1   pc02     333 non-null    float64 
 2   species  324 non-null    object  
 3   k2       333 non-null    category
dtypes: category(1), float64(2), object(1)
memory usage: 8.4+ KB
In [47]:
sns.relplot(data=pca_pens_df, x="pc01", y="pc02", hue="k2")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Larger example¶

On canvas, there is a WINE DATA SET.

In [48]:
wine_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
In [50]:
wine_names = ['Cultivar', 'Alcohol', 'Malic_acid', 'Ash', 'Alcalinity_of_ash', 'Magnesium', 'Total_phenols', 
              'Flavanoids', 'Nonflavanoid_phenols', 'Proanthocyanin', 'Color_intensity', 'Hue', 'OD280_OD315', 'Proline']
In [51]:
wine_data = pd.read_csv(wine_url, names=wine_names)
In [52]:
wine_data
Out[52]:
Cultivar Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanin Color_intensity Hue OD280_OD315 Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740
174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750
175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835
176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840
177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560

178 rows × 14 columns

In [53]:
wine_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Cultivar              178 non-null    int64  
 1   Alcohol               178 non-null    float64
 2   Malic_acid            178 non-null    float64
 3   Ash                   178 non-null    float64
 4   Alcalinity_of_ash     178 non-null    float64
 5   Magnesium             178 non-null    int64  
 6   Total_phenols         178 non-null    float64
 7   Flavanoids            178 non-null    float64
 8   Nonflavanoid_phenols  178 non-null    float64
 9   Proanthocyanin        178 non-null    float64
 10  Color_intensity       178 non-null    float64
 11  Hue                   178 non-null    float64
 12  OD280_OD315           178 non-null    float64
 13  Proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB
In [54]:
wine_data.isna().sum()
Out[54]:
Cultivar                0
Alcohol                 0
Malic_acid              0
Ash                     0
Alcalinity_of_ash       0
Magnesium               0
Total_phenols           0
Flavanoids              0
Nonflavanoid_phenols    0
Proanthocyanin          0
Color_intensity         0
Hue                     0
OD280_OD315             0
Proline                 0
dtype: int64
In [56]:
wine_data.Cultivar.value_counts()
Out[56]:
Cultivar
2    71
1    59
3    48
Name: count, dtype: int64
In [ ]:
Convert Cultivar to a categorical variable.
In [58]:
wine_data["Cultivar"] = wine_data.Cultivar.astype("category")
In [59]:
wine_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Cultivar              178 non-null    category
 1   Alcohol               178 non-null    float64 
 2   Malic_acid            178 non-null    float64 
 3   Ash                   178 non-null    float64 
 4   Alcalinity_of_ash     178 non-null    float64 
 5   Magnesium             178 non-null    int64   
 6   Total_phenols         178 non-null    float64 
 7   Flavanoids            178 non-null    float64 
 8   Nonflavanoid_phenols  178 non-null    float64 
 9   Proanthocyanin        178 non-null    float64 
 10  Color_intensity       178 non-null    float64 
 11  Hue                   178 non-null    float64 
 12  OD280_OD315           178 non-null    float64 
 13  Proline               178 non-null    int64   
dtypes: category(1), float64(11), int64(2)
memory usage: 18.5 KB

Why will PCA help here?

We could make a PAIRS PLOT between all 13 numeric columns...

In [61]:
sns.pairplot(data=wine_data, 
             hue="Cultivar",
             diag_kws={"common_norm":False})

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

PCA allows us to EXPLOIT relationships between columns!!

We know if there are relationships by creating CORRELATION PLOTS!!!

In [68]:
fig, ax = plt.subplots()

sns.heatmap(data=wine_data.corr(numeric_only=True),
            vmin=-1,
            vmax=1,
            center=0,
            cbar=False,
            cmap="coolwarm",
            annot=True,
            annot_kws={"fontsize":7},
            ax=ax)

plt.show()
No description has been provided for this image

BEFORE we execute PCA, we need to check the MAGNITUDE and SCALES!!!

In [69]:
sns.catplot(data=wine_data, kind="box", aspect=2)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Preprocess or STANDARDIZE the data¶

In [77]:
wine_data_features = wine_data.select_dtypes("number").copy()
In [78]:
wine_data_features
Out[78]:
Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanin Color_intensity Hue OD280_OD315 Proline
0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740
174 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750
175 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835
176 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840
177 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560

178 rows × 13 columns

In [79]:
Xwine = StandardScaler().fit_transform(wine_data_features)
In [80]:
Xwine
Out[80]:
array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ...,
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])
In [83]:
sns.catplot(data=pd.DataFrame(Xwine, columns=wine_data_features.columns), kind="box", aspect=3)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Execute PCA and return 2 newly created variables!!

In [84]:
pca_wine = PCA(n_components=2).fit_transform(Xwine)
In [85]:
pca_wine
Out[85]:
array([[ 3.31675081, -1.44346263],
       [ 2.20946492,  0.33339289],
       [ 2.51674015, -1.0311513 ],
       [ 3.75706561, -2.75637191],
       [ 1.00890849, -0.86983082],
       [ 3.05025392, -2.12240111],
       [ 2.44908967, -1.17485013],
       [ 2.05943687, -1.60896307],
       [ 2.5108743 , -0.91807096],
       [ 2.75362819, -0.78943767],
       [ 3.47973668, -1.30233324],
       [ 1.7547529 , -0.61197723],
       [ 2.11346234, -0.67570634],
       [ 3.45815682, -1.13062988],
       [ 4.31278391, -2.09597558],
       [ 2.3051882 , -1.66255173],
       [ 2.17195527, -2.32730534],
       [ 1.89897118, -1.63136888],
       [ 3.54198508, -2.51834367],
       [ 2.0845222 , -1.06113799],
       [ 3.12440254, -0.78689711],
       [ 1.08657007, -0.24174355],
       [ 2.53522408,  0.09184062],
       [ 1.64498834,  0.51627893],
       [ 1.76157587,  0.31714893],
       [ 0.9900791 , -0.94066734],
       [ 1.77527763, -0.68617513],
       [ 1.23542396,  0.08980704],
       [ 2.18840633, -0.68956962],
       [ 2.25610898, -0.19146194],
       [ 2.50022003, -1.24083383],
       [ 2.67741105, -1.47187365],
       [ 1.62857912, -0.05270445],
       [ 1.90269086, -1.63306043],
       [ 1.41038853, -0.69793432],
       [ 1.90382623, -0.17671095],
       [ 1.38486223, -0.65863985],
       [ 1.12220741, -0.11410976],
       [ 1.5021945 ,  0.76943201],
       [ 2.52980109, -1.80300198],
       [ 2.58809543, -0.7796163 ],
       [ 0.66848199, -0.16996094],
       [ 3.07080699, -1.15591896],
       [ 0.46220914, -0.33074213],
       [ 2.10135193,  0.07100892],
       [ 1.13616618, -1.77710739],
       [ 2.72660096, -1.19133469],
       [ 2.82133927, -0.6462586 ],
       [ 2.00985085, -1.24702946],
       [ 2.7074913 , -1.75196741],
       [ 3.21491747, -0.16699199],
       [ 2.85895983, -0.7452788 ],
       [ 3.50560436, -1.61273386],
       [ 2.22479138, -1.875168  ],
       [ 2.14698782, -1.01675154],
       [ 2.46932948, -1.32900831],
       [ 2.74151791, -1.43654878],
       [ 2.17374092, -1.21219984],
       [ 3.13938015, -1.73157912],
       [-0.92858197,  3.07348616],
       [-1.54248014,  1.38144351],
       [-1.83624976,  0.82998412],
       [ 0.03060683,  1.26278614],
       [ 2.05026161,  1.9250326 ],
       [-0.60968083,  1.90805881],
       [ 0.90022784,  0.76391147],
       [ 2.24850719,  1.88459248],
       [ 0.18338403,  2.42714611],
       [-0.81280503,  0.22051399],
       [ 1.9756205 ,  1.40328323],
       [-1.57221622,  0.88498314],
       [ 1.65768181,  0.9567122 ],
       [-0.72537239,  1.0636454 ],
       [ 2.56222717, -0.26019855],
       [ 1.83256757,  1.2878782 ],
       [-0.8679929 ,  2.44410119],
       [ 0.3700144 ,  2.15390698],
       [-1.45737704,  1.38335177],
       [ 1.26293085,  0.77084953],
       [ 0.37615037,  1.0270434 ],
       [ 0.7620639 ,  3.37505381],
       [ 1.03457797,  1.45070974],
       [-0.49487676,  2.38124353],
       [-2.53897708,  0.08744336],
       [ 0.83532015,  1.47367055],
       [ 0.78790461,  2.02662652],
       [-0.80683216,  2.23383039],
       [-0.55804262,  2.37298543],
       [-1.11511104,  1.80224719],
       [-0.55572283,  2.65754004],
       [-1.34928528,  2.11800147],
       [-1.56448261,  1.85221452],
       [-1.93255561,  1.55949546],
       [ 0.74666594,  2.31293171],
       [ 0.95745536,  2.22352843],
       [ 2.54386518, -0.16927402],
       [-0.54395259,  0.36892655],
       [ 1.03104975,  2.56556935],
       [ 2.25190942,  1.43274138],
       [ 1.41021602,  2.16619177],
       [ 0.79771979,  2.3769488 ],
       [-0.54953173,  2.29312864],
       [-0.16117374,  1.16448332],
       [-0.65979494,  2.67996119],
       [ 0.39235441,  2.09873171],
       [-1.77249908,  1.71728847],
       [-0.36626736,  2.1693533 ],
       [-1.62067257,  1.35558339],
       [ 0.08253578,  2.30623459],
       [ 1.57827507,  1.46203429],
       [ 1.42056925,  1.41820664],
       [-0.27870275,  1.93056809],
       [-1.30314497,  0.76317231],
       [-0.45707187,  2.26941561],
       [-0.49418585,  1.93904505],
       [ 0.48207441,  3.87178385],
       [-0.25288888,  2.82149237],
       [-0.10722764,  1.92892204],
       [-2.4330126 ,  1.25714104],
       [-0.55108954,  2.22216155],
       [ 0.73962193,  1.40895667],
       [ 1.33632173, -0.25333693],
       [-1.177087  ,  0.66396684],
       [-0.46233501,  0.61828818],
       [ 0.97847408,  1.4455705 ],
       [-0.09680973,  2.10999799],
       [ 0.03848715,  1.26676211],
       [-1.5971585 ,  1.20814357],
       [-0.47956492,  1.93884066],
       [-1.79283347,  1.1502881 ],
       [-1.32710166, -0.17038923],
       [-2.38450083, -0.37458261],
       [-2.9369401 , -0.26386183],
       [-2.14681113, -0.36825495],
       [-2.36986949,  0.45963481],
       [-3.06384157, -0.35341284],
       [-3.91575378, -0.15458252],
       [-3.93646339, -0.65968723],
       [-3.09427612, -0.34884276],
       [-2.37447163, -0.29198035],
       [-2.77881295, -0.28680487],
       [-2.28656128, -0.37250784],
       [-2.98563349, -0.48921791],
       [-2.3751947 , -0.48233372],
       [-2.20986553, -1.1600525 ],
       [-2.625621  , -0.56316076],
       [-4.28063878, -0.64967096],
       [-3.58264137, -1.27270275],
       [-2.80706372, -1.57053379],
       [-2.89965933, -2.04105701],
       [-2.32073698, -2.35636608],
       [-2.54983095, -2.04528309],
       [-1.81254128, -1.52764595],
       [-2.76014464, -2.13893235],
       [-2.7371505 , -0.40988627],
       [-3.60486887, -1.80238422],
       [-2.889826  , -1.92521861],
       [-3.39215608, -1.31187639],
       [-1.0481819 , -3.51508969],
       [-1.60991228, -2.40663816],
       [-3.14313097, -0.73816104],
       [-2.2401569 , -1.17546529],
       [-2.84767378, -0.55604397],
       [-2.59749706, -0.69796554],
       [-2.94929937, -1.55530896],
       [-3.53003227, -0.8825268 ],
       [-2.40611054, -2.59235618],
       [-2.92908473, -1.27444695],
       [-2.18141278, -2.07753731],
       [-2.38092779, -2.58866743],
       [-3.21161722,  0.2512491 ],
       [-3.67791872, -0.84774784],
       [-2.4655558 , -2.1937983 ],
       [-3.37052415, -2.21628914],
       [-2.60195585, -1.75722935],
       [-2.67783946, -2.76089913],
       [-2.38701709, -2.29734668],
       [-3.20875816, -2.76891957]])
In [86]:
pca_wine_df = pd.DataFrame(pca_wine, columns=["pc01", "pc02"])
In [87]:
pca_wine_df
Out[87]:
pc01 pc02
0 3.316751 -1.443463
1 2.209465 0.333393
2 2.516740 -1.031151
3 3.757066 -2.756372
4 1.008908 -0.869831
... ... ...
173 -3.370524 -2.216289
174 -2.601956 -1.757229
175 -2.677839 -2.760899
176 -2.387017 -2.297347
177 -3.208758 -2.768920

178 rows × 2 columns

In [88]:
sns.relplot(data=pca_wine_df, x="pc01", y="pc02")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Lets add the KNOWN Groupings

In [89]:
pca_wine_df["Cultivar"] = wine_data.Cultivar
In [90]:
pca_wine_df
Out[90]:
pc01 pc02 Cultivar
0 3.316751 -1.443463 1
1 2.209465 0.333393 1
2 2.516740 -1.031151 1
3 3.757066 -2.756372 1
4 1.008908 -0.869831 1
... ... ... ...
173 -3.370524 -2.216289 3
174 -2.601956 -1.757229 3
175 -2.677839 -2.760899 3
176 -2.387017 -2.297347 3
177 -3.208758 -2.768920 3

178 rows × 3 columns

In [91]:
sns.relplot(data=pca_wine_df, x="pc01", y="pc02", hue="Cultivar")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Run KMeans with 3 Clusters and visualize the 3 cluster labels with the NEWLY CREATED PCA!!!

In [92]:
clusters_3 = KMeans(n_clusters=3, random_state=121, n_init=25, max_iter=500).fit_predict(Xwine)
In [93]:
clusters_3
Out[93]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int32)
In [97]:
pca_wine_df["k3"] = pd.Series(clusters_3, index=pca_wine_df.index).astype("category")
In [98]:
pca_wine_df
Out[98]:
pc01 pc02 Cultivar k3
0 3.316751 -1.443463 1 1
1 2.209465 0.333393 1 1
2 2.516740 -1.031151 1 1
3 3.757066 -2.756372 1 1
4 1.008908 -0.869831 1 1
... ... ... ... ...
173 -3.370524 -2.216289 3 0
174 -2.601956 -1.757229 3 0
175 -2.677839 -2.760899 3 0
176 -2.387017 -2.297347 3 0
177 -3.208758 -2.768920 3 0

178 rows × 4 columns

In [99]:
sns.relplot(data=pca_wine_df, x="pc01", y="pc02", hue="k3", style="Cultivar")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

A really big example¶

Use the Sonar data!

In [100]:
sonar_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
In [101]:
sonar_df = pd.read_csv( sonar_url, header=None )
In [102]:
sonar_df.shape
Out[102]:
(208, 61)

Convert the col names to strings.

In [105]:
sonar_df.columns = ["X%02d" % d for d in sonar_df.columns]
In [106]:
sonar_df.columns
Out[106]:
Index(['X00', 'X01', 'X02', 'X03', 'X04', 'X05', 'X06', 'X07', 'X08', 'X09',
       'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',
       'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29',
       'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39',
       'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49',
       'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59',
       'X60'],
      dtype='object')
In [107]:
sonar_df.nunique()
Out[107]:
X00    177
X01    182
X02    190
X03    181
X04    193
      ... 
X56    121
X57    124
X58    119
X59    109
X60      2
Length: 61, dtype: int64
In [108]:
sonar_df.X60.value_counts()
Out[108]:
X60
M    111
R     97
Name: count, dtype: int64
In [110]:
sonar_df.isna().sum().max()
Out[110]:
0

Lets look at the correlation structure between ALL numeric cols!

In [112]:
fig, ax = plt.subplots()

sns.heatmap(sonar_df.corr(numeric_only=True),
            vmin=-1, 
            vmax=1,
            center=0,
            cmap="coolwarm",
            ax=ax)

plt.show()
No description has been provided for this image

Even tho there are 60 numeric columns many of the vars are HIGHLY CORRELATED!!

Lets exploit the correlation thru PCA!!

But first, we must check the scales!!

In [114]:
sns.catplot(data=sonar_df, kind="box", aspect=3)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

The SCALES are NOT the same across cols so we need to standardize.

In [115]:
sonar_features = sonar_df.select_dtypes("number").copy()
In [116]:
Xsonar = StandardScaler().fit_transform(sonar_features)
In [117]:
Xsonar
Out[117]:
array([[-0.39955135, -0.04064823, -0.02692565, ...,  0.06987027,
         0.17167808, -0.65894689],
       [ 0.70353822,  0.42163039,  1.05561832, ..., -0.47240644,
        -0.44455424, -0.41985233],
       [-0.12922901,  0.60106749,  1.72340448, ...,  1.30935987,
         0.25276128,  0.25758223],
       ...,
       [ 1.00438083,  0.16007801, -0.67384349, ...,  0.90652575,
        -0.03913824, -0.67887143],
       [ 0.04953255, -0.09539176,  0.13480381, ..., -0.00759783,
        -0.70402047, -0.34015415],
       [-0.13794908, -0.06497869, -0.78861924, ..., -0.6738235 ,
        -0.29860448,  0.99479044]])
In [121]:
sns.catplot(data=pd.DataFrame(Xsonar, columns=sonar_features.columns), kind="box", aspect=3)

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

APPLY PCA and return 2 NEWLY Created Variables to support visualization!!

In [122]:
sonar_pca = PCA(n_components=2).fit_transform(Xsonar)
In [123]:
sonar_pca
Out[123]:
array([[ 1.92116817, -1.37089312],
       [-0.48012458,  7.58638801],
       [ 3.8592282 ,  6.43986016],
       [ 4.59741943, -3.10408888],
       [-0.53386761,  1.84984701],
       [-1.24701593,  3.78548414],
       [ 1.87007312,  2.49551038],
       [-2.05769816,  2.3147504 ],
       [-1.64556277,  0.25372155],
       [-4.28065736, -2.42781795],
       [-1.46164351, -6.32305562],
       [-2.46394888, -1.2537634 ],
       [-3.99546982,  1.64506244],
       [ 0.6370814 , -0.63741683],
       [-0.10539302, -0.25210417],
       [ 2.11242307,  0.59393523],
       [ 4.39574903, -2.25749069],
       [ 1.43859617,  1.90219042],
       [-1.03943408, -3.29436397],
       [-1.16485881,  8.59655069],
       [ 2.64812566,  1.66803742],
       [ 6.23535677, -1.47389049],
       [11.23389579, -2.75609298],
       [-0.24732176, -4.86351661],
       [ 2.65154822, -4.39934635],
       [-0.42203896, -7.16826626],
       [-3.69919995,  2.49392786],
       [-2.90589296,  0.16356259],
       [-1.8957691 ,  1.49786172],
       [-2.38880313,  1.37815246],
       [-2.32050849, -1.198227  ],
       [-3.50572573, -0.58086138],
       [ 0.04322219,  0.36634604],
       [ 1.0292047 ,  0.06587682],
       [-0.68903218,  1.11801579],
       [-1.9337308 ,  0.63038558],
       [ 0.26804541, -3.41912075],
       [-2.12333945, -4.50443015],
       [-2.58654933, -4.9379112 ],
       [ 0.16018513, -3.83652922],
       [-0.88614897, -3.47589265],
       [-1.59765115, -4.41483069],
       [ 1.48360897, -4.35709934],
       [ 1.44697712, -2.29112772],
       [ 8.48668013, -2.31220758],
       [ 1.27737697,  0.93420479],
       [ 0.08854593,  2.81051365],
       [ 2.06883858, -3.43706227],
       [-0.78138299, -1.61233704],
       [-0.31856672, -1.11443574],
       [ 3.95584037, -4.66138876],
       [-2.0951562 , -3.05837102],
       [-4.58738357, -1.74537817],
       [-2.66763757, -2.46386482],
       [-2.57370302, -2.84072656],
       [-2.98845891, -3.06711747],
       [-3.701205  , -3.04206607],
       [-1.61290675, -4.24162035],
       [-3.66320186, -4.09758109],
       [-3.95156859, -3.95691785],
       [-4.28849494, -3.67404719],
       [-3.99022937, -2.92534594],
       [-3.02795242, -4.61601499],
       [-2.06113673, -4.8504963 ],
       [ 3.89602028, -5.01129356],
       [ 4.66868527, -4.75463343],
       [ 4.27249405, -5.52137645],
       [ 4.78482642, -4.44051836],
       [ 0.02676699, -5.50724361],
       [-3.54961304, -3.08357687],
       [-3.76737904, -3.78225243],
       [-3.59432514, -3.07303315],
       [-3.62792705, -2.73985894],
       [-3.06532371, -3.174952  ],
       [-6.22064858,  1.15407451],
       [-5.09714812,  1.82131978],
       [-5.34001492,  2.02732079],
       [-4.56483724,  2.21245894],
       [-5.09218846,  2.4294874 ],
       [-5.51236077,  0.72473133],
       [-2.97181487,  3.43011963],
       [-1.73504223,  3.28751766],
       [-1.76972577,  2.62471132],
       [-2.32289342,  3.58772744],
       [-1.87966015,  3.65487563],
       [ 1.61109246,  4.64124374],
       [ 0.01326033,  1.5468855 ],
       [ 2.29479443,  1.45945475],
       [-0.42317649,  3.35109663],
       [ 0.29213153,  1.36445907],
       [ 0.06419184,  3.47830589],
       [ 0.16473305,  2.67225937],
       [-1.69516157, -3.03886575],
       [-0.24328844, -1.059707  ],
       [-1.70203039, -2.61915102],
       [ 0.44918644, -2.36052645],
       [ 1.21856416, -3.23833273],
       [ 6.36782773,  0.73642288],
       [ 5.86566105,  6.22201134],
       [ 2.32811749,  1.35622522],
       [ 0.81340327,  7.90835107],
       [ 0.92296306,  7.28755554],
       [ 1.32066581,  6.51343355],
       [-1.93304948,  5.04964033],
       [ 0.89214574,  5.77325882],
       [-1.93311548,  4.98293974],
       [ 0.16878754,  1.31765581],
       [-0.15581201,  2.39484391],
       [-1.1262658 ,  2.64804865],
       [-0.91894902,  1.06356863],
       [-1.86588194,  1.58062903],
       [ 1.37994883,  6.21905437],
       [-0.01613465,  5.26499384],
       [-1.61769778,  3.23288493],
       [-1.67074261,  4.08643755],
       [ 0.0870092 ,  4.47552792],
       [-2.34873512,  4.3751504 ],
       [-2.14290612,  1.8136091 ],
       [-3.01477158,  1.36962041],
       [-3.54250316,  2.37796744],
       [-2.8669399 , -0.1595899 ],
       [-3.03420417,  1.5630714 ],
       [-3.03508243,  2.89658961],
       [-2.21909884,  2.20827759],
       [-2.05969086,  4.05921872],
       [-1.55005899,  4.7435195 ],
       [-1.19864919,  7.38946477],
       [-0.06868931,  7.7256891 ],
       [-1.40862679,  6.62963748],
       [ 0.67776489,  7.96673913],
       [ 0.38189881,  8.77442069],
       [ 7.62980234,  1.86948666],
       [ 6.91257294,  3.10473738],
       [ 8.73901258,  2.71886885],
       [ 7.29487999,  2.78439377],
       [ 9.01527493,  3.31204677],
       [ 9.14299203,  2.76810222],
       [ 2.58584412,  2.14440232],
       [ 4.72295305,  2.60215625],
       [ 1.19188425, -1.00996638],
       [ 8.01516938,  0.42759914],
       [ 5.45084813,  0.98131544],
       [ 7.37874929, -0.53429043],
       [ 6.75153404, -1.06718757],
       [ 5.47649138, -1.67658293],
       [ 6.16730829,  0.97947431],
       [11.72743348,  0.46614983],
       [ 8.60415575,  4.6141592 ],
       [-0.12347172,  3.43192712],
       [-0.60637782,  0.77910874],
       [-3.56763182,  1.33976644],
       [-2.38544066,  1.95848913],
       [-1.15220818, -0.67390243],
       [-3.22063973,  2.6750803 ],
       [-4.09421487,  1.95705408],
       [-2.89979522,  1.60819509],
       [ 3.6264736 , -4.91118104],
       [ 5.92788489, -5.43400112],
       [ 4.93422552, -4.55292399],
       [ 4.7195731 , -4.25764292],
       [ 3.41729015, -4.84215461],
       [ 6.84896068, -5.60595577],
       [-0.83074146, -2.66580792],
       [-2.43810895, -0.7604101 ],
       [ 3.3151307 , -1.2477695 ],
       [-0.56576422, -1.21374219],
       [ 4.84375027, -4.1113877 ],
       [ 1.09065734, -4.36237998],
       [-1.50595678, -3.00786655],
       [-1.1813717 , -3.23331227],
       [-0.58844272, -3.56534756],
       [ 0.66200435,  1.82805944],
       [ 0.09440455, -1.55637496],
       [-2.41074113,  1.10272446],
       [ 0.17077759,  1.29275173],
       [-2.01648931,  0.28502622],
       [-1.43510345,  2.15770986],
       [-2.63642837, -1.32459419],
       [-2.37989072, -2.75099781],
       [-0.59518874, -1.40836032],
       [ 0.17127407,  0.80912805],
       [ 4.07344151, -1.32171421],
       [ 2.0892225 , -0.39495668],
       [ 2.18537714,  0.18559798],
       [ 1.72078994,  2.76819856],
       [ 0.9634835 ,  0.59891783],
       [ 5.33447453, -1.83583008],
       [ 0.75309666, -2.40472434],
       [-0.57828812, -2.91229356],
       [-1.6041077 , -1.89033976],
       [-1.24218934, -2.49121324],
       [-2.04098389, -2.4827779 ],
       [-2.3234503 , -2.29857771],
       [-1.75482615, -3.38829039],
       [-3.14194761, -2.36712914],
       [-3.08302426, -1.23110785],
       [-3.86592726, -0.58675473],
       [-3.61880911, -1.34121736],
       [-3.48250759, -1.15015708],
       [-3.94549651, -0.70515166],
       [-3.13198027,  0.18397254],
       [-3.61423572,  0.15117433],
       [-1.84562154, -0.88930777],
       [-1.20765295, -0.9681736 ],
       [-2.97143919, -2.75349246],
       [-2.29321041, -2.75544556],
       [-3.11446433, -1.85054952],
       [-3.23862419, -2.27709396]])
In [124]:
sonar_pca_df = pd.DataFrame(sonar_pca, columns=["pc01", "pc02"])
In [125]:
sonar_pca_df
Out[125]:
pc01 pc02
0 1.921168 -1.370893
1 -0.480125 7.586388
2 3.859228 6.439860
3 4.597419 -3.104089
4 -0.533868 1.849847
... ... ...
203 -1.207653 -0.968174
204 -2.971439 -2.753492
205 -2.293210 -2.755446
206 -3.114464 -1.850550
207 -3.238624 -2.277094

208 rows × 2 columns

In [127]:
sonar_pca_df["X60"] = sonar_df.X60
In [128]:
sonar_pca_df
Out[128]:
pc01 pc02 X60
0 1.921168 -1.370893 R
1 -0.480125 7.586388 R
2 3.859228 6.439860 R
3 4.597419 -3.104089 R
4 -0.533868 1.849847 R
... ... ... ...
203 -1.207653 -0.968174 M
204 -2.971439 -2.753492 M
205 -2.293210 -2.755446 M
206 -3.114464 -1.850550 M
207 -3.238624 -2.277094 M

208 rows × 3 columns

In [130]:
sns.relplot(data=sonar_pca_df, x="pc01", y="pc02")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image

Color by the categorical variable.

In [132]:
sns.relplot(data=sonar_pca_df, x="pc01", y="pc02", hue="X60", palette="Set1")

plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
No description has been provided for this image
In [ ]: